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Q-LEARNING WITH CENSORED DATA 

By Yair Goldberg and Michael R. Kosorok 1 

University of North Carolina at Chapel Hill 

We develop methodology for a multistage decision problem with 
flexible number of stages in which the rewards are survival times that 
are subject to censoring. We present a novel Q-learning algorithm 
that is adjusted for censored data and allows a flexible number of 
stages. We provide finite sample bounds on the generalization error of 
the policy learned by the algorithm, and show that when the optimal 
Q-function belongs to the approximation space, the expected survival 
time for policies obtained by the algorithm converges to that of the 
optimal policy. We simulate a multistage clinical trial with flexible 
number of stages and apply the proposed censored-Q-learning algo- 
rithm to find individualized treatment regimens. The methodology 
presented in this paper has implications in the design of personalized 
medicine trials in cancer and in other life-threatening diseases. 

1. Introduction. In medical research, dynamic treatment regimes are in- 
creasingly used to choose effective treatments for individual patients with 
long-term patient care. A dynamic treatment regime (or similarly, policy) 
is a set of decision rules for how the treatment should be chosen at each 
decision time-point, depending on both the patient's medical history up to 
the current time-point and the previous treatments. Note that although the 
same set of decision rules is applied to all patients, the choice of treatment 
at a given time-point may differ, depending on the patient's medical state. 
Moreover, the patient's treatment plan is not known at the beginning of 
a dynamic regime, since it may depend on subsequent time- varying variables 
that may be influenced by earlier treatments and response to treatment. An 
optimal treatment regime is a set of treatment choices that maximizes the 
mean response of some clinical outcome at the end of the final time interval 
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[see, e.g., Murphy (2003), Robins (2004), Moodie, Richardson and Stephens 
(2007)]. 

We consider the problem of finding treatment regimes that lead to longer 
survival times, where the number of treatments is flexible and where the 
data are subject to censoring. This type of framework is natural for cancer 
applications, where the initiation of the next line of therapy depends on the 
disease progression and thus the number of treatments is flexible. In addi- 
tion, data are subject to censoring since patients can drop out during the 
trial. For example, in advanced nonsmall cell lung cancer (NSCLC), patients 
receive one to three treatment lines. The timing of the second and third lines 
of treatment is determined by the disease progression and by the ability of 
patients to tolerate therapy [Stinchcombe and Socinski (2008), Krzakowski 
et al. (2010)]. We focus on mean survival time restricted to a specific in- 
terval, since in a limited-time study, censoring prevents reliable estimation 
of the unrestricted mean survival time [see discussion in Karrison (1997), 
Zucker (1998), Chen and Tsiatis (2001); see also Wahed and Tsiatis (2006) 
in the context of sequential decision problems and see Robins, Orellana and 
Rotnitzky (2008) for an alternative approach]. 

Finding an optimal policy for survival data poses many statistical chal- 
lenges. We enumerate four. First, one needs to incorporate information ac- 
crued over time into the decision rule. Second, one needs to avoid treatments 
which appear optimal in the short term but may lead to poor final outcome 
in the long run. Third, the data are subject to censoring since some of the 
patients may be lost to follow-up and the final outcome of those who reached 
the end of the study alive is unknown. Fourth, the number of decision points 
(i.e., treatments) and the timing of these decision points can be different for 
different patients. This follows since the number of treatments and duration 
between treatments may depend on the medical condition of the patient. In 
addition, in the case of a patient's death, treatment is stopped. The first 
two challenges are shared with general multistage decision optimization [La- 
vori and Dawson (2004), Moodie, Richardson and Stephens (2007)]. The 
latter two arise naturally in the context of optimizing survival time, but 
are applicable to other scenarios as well. Developing valid methodology for 
estimating dynamic treatment regimes in this flexible timing setup is crucial 
for applications in cancer and in other diseases where such structure is the 
norm and appropriate existing methods are unavailable. 

One of the primary tools used in developing dynamic treatment regimes 
is Q-learning [Murphy et al. (2007), Zhao, Kosorok and Zeng (2009), Laber 
et al. (2010), Zhao et al. (2011)]. Q-learning [Watkins (1989), Watkins and 
Dayan (1992)], which is reviewed in Section 2, is a reinforcement learning al- 
gorithm. Since we do not assume that the problem is Markovian, we present 
a version of Q-learning that uses backward recursion. The backward recur- 
sion used by Q-learning addresses the first two challenges posed above: it 
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enables both accrual of information and incorporation of long-term treat- 
ment effects. However, when the number of stages is flexible, and censoring 
is introduced, it is not clear how to implement backward recursion. Indeed, 
finding the optimal treatment at the last stage is not well defined, since the 
number of stages is patient-dependent. Also, it is not clear how to utilize 
the information regarding censored patients. 

In this paper we present a novel Q-learning algorithm that takes into 
account the censored nature of the observations using inverse-probability-of- 
censoring weighting [see Robins, Rotnitzky and Zhao (1994); see also Wahed 
and Tsiatis (2006), Robins, Orellana and Rotnitzky (2008) in the context 
of sequential decision problems]. We provide finite sample bounds on the 
generalization error of the policy learned by the algorithm, that is, bounds 
on the average difference in expected survival time between the optimal 
dynamic treatment regime and the dynamic treatment regime obtained by 
the proposed Q-learning algorithm. We also present a simulation study of 
a sequential-multiple-assignment randomized trial (SMART) [see Murphy 
(2005a) and references therein] with flexible number of stages depending 
on disease progression and failure event timing. We demonstrate that the 
censored- Q-learning algorithm proposed here can find treatment strategies 
tailored to each patient which are better than any fixed-treatment sequence. 
We also demonstrate the result from ignoring censored observations. 

One general contribution of the paper is the development of a methodol- 
ogy for solving backward recursion when the number and timing of stages 
are flexible. As mentioned previously, this is crucial for applications but has 
not been addressed previously. In Section 4 we present an auxiliary multi- 
stage decision problem that has a fixed number of stages. Since the number 
of stages is fixed for the auxiliary problem, backward recursion can be used 
in order to estimate the decision policy. We then show how to translate the 
original problem to the auxiliary one and obtain the surprising conclusion 
that results obtained for the auxiliary problem can be translated into results 
regarding the original problem with flexible number and timing of stages. 

An additional contribution of the paper is the universal consistency proof 
for the algorithm performance. Universal consistency of an algorithm means 
that for every distribution function on the sample space, the expected loss 
of the function learned by the algorithm converges in probability to the 
infimum of the expected loss, where the infimum is taken over all the mea- 
surable functions [see, e.g., Steinwart and Christmann (2008)]. In Section 6 
we prove that when the optimal Q-functions belong to the corresponding 
approximation spaces considered by the algorithm, the algorithm is univer- 
sally consistent. The proof presented here is algorithm-specific, but the tools 
used in the proof are widely applicable for universal consistency proofs when 
the data are subject to censoring [see, e.g., Goldberg and Kosorok (2012)]. 
While other learning algorithms were suggested for survival data [see, e.g., 
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Biganzoli et al. (1998), Shivaswamy, Chu and Jansche (2007), Shim and 
Hwang (2009); see also Zhao et al. (2011) in the context of a multistage 
decision problem] , we are not aware of any other universal consistency proof 
for survival data. 

The paper is organized as follows. In Section 2 we review the Q-learning 
algorithm and discuss the challenges for adapting the Q-learning methodol- 
ogy for a framework with flexible number of stages and censored data. We 
also review existing methods for finding optimal policies. Definitions and 
notation are presented in Section 3. The auxiliary problem is presented in 
Section 4. The censored-Q-learning algorithm is presented in Section 5. The 
main theoretical results are presented in Section 6. In Section 7 we present 
a multistage-randomized-trial simulation study. Concluding remarks appear 
in Section 8. Supplementary proofs are provided in the Appendix. A descrip- 
tion of and link to the code and data sets used in Section 7 appear in the 
supplementary material [Goldberg and Kosorok (2012)]. 

2. Q-learning. 

2.1. Reinforcement learning. Reinforcement learning is a methodology 
for solving multistage decision problems. It involves recording sequences of 
actions, statistically estimating the relationship between these actions and 
their consequences and then choosing a policy (i.e., a set of decision rules) 
that approximates the most desirable consequence based on the statistical 
estimation. A detailed introduction to reinforcement learning can be found 
in Sutton and Barto (1998). 

In the medical context of long-term patient care, the reinforcement learn- 
ing setting can be described as follows. For each patient, the stages corre- 
spond to clinical decision points in the course of the patient's treatment. 
At these decision points, actions (e.g., treatments) are chosen, and the state 
of the patient is recorded. As a consequence of a patient's treatment, the 
patient receives a (random) numerical reward. 

More formally, consider a multistage decision problem with T decision 
points. Let St be the (random) state of the patient at stage t G {1, . . . , T + 1} 
and let St = {S±,...,St} be the vector of all states up to and including 
stage t. Similarly, let At be the action chosen in stage t, and let = 
{Ai, . . . ,At} be the vector of all actions up to and including stage t. We 
use the corresponding lower case to denote a realization of these random 
variables and random vectors. Let the random reward be denoted Rt = 
r(St,At,St+i), where r is an (unknown) time-dependent deterministic func- 
tion of all states up to stage t + 1 and all past actions up to stage t. A tra- 
jectory is defined as a realization of (St+i, At, Rt)- Note that we do not 
assume that the problem is Markovian. In the medical context example, 
a trajectory is a record of all the patient covariates at the different deci- 
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sion points, the treatments that were given, and the medical outcome in 
numerical terms. 

We define a policy, or similarly, a dynamic treatment regime, to be a set 
of decision rules. More formally, define a policy ir to be a sequence of de- 
terministic decision rules, {iri, . . . ,wt}, where for every pair (st,SLt-i), the 
output of the tth decision rule, nt(st,SLt-i), is an action. Our goal is to find 
a policy that maximizes the expected sum of rewards. The Bellman equation 
[Bellman (1957)] characterizes the optimal policy n* as one that satisfies the 
following recursive relation: 

(1) 7r t *(s t ,a t _i) = argmaxE[R t + V t * +1 (S t+1 , A*)|S f = s u A t = a t ], 

at 

where the value function 



(2) V* 1 (s t +i,^) = E, 



T 

Ri St+i = st+i, At = a* 

i=t+i 



is the expected cumulative sum of rewards from stage t + 1 to stage T, 
where the history up to stage t + 1 is given by {st + i,a(}, and when using 
the optimal policy n* thereafter. 

Finding a policy that leads to a high expected cumulative reward is the 
main goal of reinforcement learning. Naively, one could learn the transition 
distribution functions and the reward function using the observed trajecto- 
ries, and then solve the Bellman equation recursively. However, this approach 
is inefficient both computationally and memory-wise. In the following sec- 
tion, we introduce the Q-learning algorithm, which requires less memory and 
less computation. 

2.2. Q-learning. Q-learning [Watkins (1989)] is an algorithm for solving 
reinforcement learning problems. It is claimed by Sutton and Barto to be one 
of the most important breakthroughs in reinforcement learning [Sutton and 
Barto (1998), Section 6.5]. Q-learning uses backward recursion to compute 
the Bellman equation without the need to know the full dynamics of the 
process. 

More formally, we define the optimal time-dependent Q-function 
Qt*(st,a t ) = E[Rt + V t * +1 (S t+1 ,A t )\S t = s t ,A t = a*]. 
Note that V* (s t ,a.t-i) = max fli Q* t (st,a t ), and thus 

(3) Qt(s t ,a t ) = E R t + maxQ* t+1 (S t+ i,A t ,a t+ i)\St = s t ,A t = a t 

l a t +i 

In order to estimate the optimal policy, one first estimates the Q-functions 
backward through time t = T, T — 1, . . . , 1 and obtains a sequence of estima- 
tors {Qt, ■ ■ ■ ) Qi}- The estimated policy is given by 

(4) 7r t (s t ,at_i) = argmaxQ t (s t ,a t _i,a t ). 

at 
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In the next section we discuss the difficulties in applying the Q-learning 
methodology when trajectories are subject to censoring and the number of 
stages is flexible. 

2.3. Challenges with flexible number of stages and censoring. As dis- 
cussed in the Introduction, our goal is to develop a Q-learning algorithm 
that can handle a flexible number of stages and that takes into account the 
censored nature of the observations. We face two main challenges. First, re- 
call that the estimation of the Q-functions in (3) is done recursively, starting 
from the last stage backward. Thus, when the number of stages is flexible, 
it is not clear how to perform the base step of the recursion. Second, due 
to censoring, some of the trajectories may be incomplete. Incorporating the 
data of a censored trajectory is problematic: even when the number of stages 
is fixed, the known number of stages for a censored trajectory may be less 
than the number of stages in the multistage problem. Moreover, the reward 
is not known for the stage at which censoring occurs. 

2.4. Review of existing approaches. Finding optimal policies or optimal 
treatment regimes has been discussed extensively in other work. We discuss 
shortly some additional work that is related to the approach taken here. 
However, we are not aware of any other existing approaches that address 
simultaneously both censoring and flexible number of stages. 

The approach closest to our proposal is the censored-Q-learning algo- 
rithm of Zhao et al. (2011). Zhao et al. considered a Q-learning algorithm 
for censored data based on support vector regression adjusted for censoring 
with fixed number of stages. A simulation study was performed to demon- 
strate the algorithm's performance; however, the theoretical properties of 
this algorithm were not evaluated. 

A general approach for finding optimal policies that uses backward recur- 
sion was studied by Murphy (2003) and Robins (2004) in the semiparametric 
context, and by Murphy (2005b) in the nonparametric context. These works 
do not treat flexible number of stages or censoring, and cannot be applied 
to the framework considered here without some adjustments. 

Another approach for finding optimal policies was studied by Orellana, 
Rotnitzky and Robins (2010) [see also van der Laan and Petersen (2007), 
Robins, Orellana and Rotnitzky (2008)]. Orellana et al. considered dynamic 
regime marginal structural mean models [Robins (1999)]. In this approach, 
for each regime, one considers all trajectories that comply to the regime up 
to some point. The trajectories are then censored at the first time-point at 
which they do not comply to the regime. The contribution of the noncom- 
pliant trajectories is redistributed among compliant trajectories that have 
the same covariate and treatment history, using the inverse-probability-of- 
censoring weighting. Advantages and disadvantages of this approach com- 
pared to the backward recursion approach mentioned above are discussed 
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in Robins, Orellana and Rotnitzky (2008), Section 5. We note that it is as- 
sumed in their approach that the length of each stage is fixed, an assumption 
we do not require. 

This general issue is also related to the analysis of two-stage randomized 
trials involving right-censored data studied in a series of papers including 
Lunceford, Davidian and Tsiatis (2002), Wahed and Tsiatis (2006), Wahed 
(2009), Miyahara and Wahed (2010). The authors use inverse-probability-of- 
censoring to correct for censoring. See also Thall et al. (2007) that considers 
analysis of two-stage randomized trials with interval censoring. However, the 
main focus of these works is in finding the best regime from a finite num- 
ber of optional regimes, as opposed to the individualized-treatment policies 
addressed in our proposal. 

3. Preliminaries. In this section we present definitions and notation which 
will be used in the paper. 

Let T be the maximal number of decision time-points for a given multi- 
stage time-dependent decision problem. Note that the number of stages for 
different observations can be different. For each t = 1, . . . ,T, the state St is 
the pair St = (Zt,Rt-i), where Zt is either a vector of covariates describing 
the state of the patient at beginning of stage t or Zt = 0. Zt = indi- 
cates that a failure event happened during the tth stage which has therefore 
reached a terminal state. Rt-i is the length of the interval between decision 
time-points t — 1 and t, where we denote Ro = 0. Although in the usual Q- 
learning context Xw=i Rj ^ s the sum of rewards up to and including stage t, 
in our context it is more useful to think of this sum as the total survival 
time up to and including stage t. Let At be an action chosen at decision 
time t, where At takes its values in a finite discrete space A. 

The model assumes that observations are subject to censoring. Let C be 
a censoring variable and let Sc(x) = P(C > x) be its survival function. We 
assume that censoring is independent of both covariates and failure time. 
We assume that C takes its values in the segment [0,r] where r < oo and 
that Sc(t) > K m \ a > 0. Let 5t be an indicator with 5t = 1 if no censoring 
event happened before the (t + l)th decision time-point. Note that 5t~i = 
0^d t = 0. 

Remark 3.1. Note that for a censoring variable, we define the survival 
function Sc(x) as P(C > x) rather than the usual P(C > x). This is be- 
cause given a failure time x, we are interested in the probability P{C > x). 
However, to avoid complications that are not of interest to the main results 
of this paper, we assume that the probability of simultaneous failure and 
censoring is zero [see, e.g., Satten and Datta (2001)]. 

The inclusion of failure times in the model affects the trajectory structure. 
Usually, a trajectory is defined (2T + l)-length sequence {Si, A\, S2, • • • , 
At, St+i}- However, in our context, if a failure event occurs before decision 
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time-point T, the trajectory will not be of full length. Denote by T the 
(random) number of stages for the individual (T <T). Due to the censoring, 
the trajectories themselves are not necessarily fully observed. Assume that 
a censoring event occurred during stage t. Note that this means that St-i = 1 
while St = and that C < Yll=i-Ri- ^ n this case the observed trajectories 
have the following structure: {S±, A±, S2, • • • , At} and C is also observed. 

We now discuss the distribution of the observed trajectories. Assume 
that n trajectories are sampled at random according to a fixed distribution 
denoted by Po. The distribution Pq is composed of the unknown distribution 
of each St conditional on (Sf_i, A t „i) (denoted by {/1, . . . , /t}) and an ex- 
ploration policy that generates the actions. Denote the exploration policy by 
p = {pi, . . . ,pt} where the probability that action a is taken given history 
{S t , A 4 _i} is pt(a\St, At_i). We assume that pt(a\st, a.t-1) > L~ l for every 
action a £ A and for each possible value (st, at— 1), where L > 1 is a constant. 
The likelihood (under Po) of the trajectory {s±, a±, S2, ■ ■ ■ ,at, Sf+i} is 
t 

]^(/ J (s J |s i _i,a i _i)p i (a jf |sj,a i _i))/ F+1 (s t+ i|s F , a f ). 
i=2 

We denote expectations with respect to the distribution Po by Eq. The 
survival function with respect to the distribution Po is denoted by G(x) = 
Po(^J =1 Pj > x). We assume that G(t) > G mm > 0, that is, that there is 
a positive probability that the survival time is greater than r. 

We define policy ir to be a sequence of deterministic decision rules, {iri , . . . , 
ttt}, where for every nonterminating pair (sj,at_i), the output of the ith 
decision rule, 7Tt(st, &t-i), is an action. Let the distribution Po j7r denote the 
distribution of a trajectory for which the policy 7r is used to generate the ac- 
tions. The likelihood (under Po l7r ) of the trajectory, {s\, 01, S2, ■ ■ ■ , at, Sf+i} is 
t 

/l(si)l7r(si)=ai X^ifjisMj-U^j-l^j^j^^a^ft+list+l^U^t)- 

i=2 

Our goal is to find a policy that maximizes the expected rewards. Since 
with probability 1 C < r, the maximum observed survival time is less than 
or equal to r. Thus we try to maximize the truncated-by-r expected survival 
time. Formally, we look for a policy n that approximates the maximum over 
all deterministic policies of the following expectation: 




where Po,vr is the expectation with respect to Po )7r and a A 6 = min{a, b}. 



4. The auxiliary problem. In this section we construct an auxiliary Q- 
learning model for our original problem. The modified trajectories of the 
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construction are of fixed length T, and the modified sum of rewards is less 
than or equal to r. We then show how results obtained for the auxiliary 
problem can be translated into results regarding the original problem. 

For the auxiliary problem, we complete all trajectories to full length in 
the following way. Assume that a failure time occurred at stage t <T. In 
that case the trajectory up to St+i is already defined. Write S'j = Sj for 
1 < j < t+1 and A'j = Aj for 1 < j < t. For all t+1 < j < T+1 set Sj = (0, 0) 
and for all t + 1 < j < T draw Aj uniformly from A. 

We also modify trajectories with overall survival time greater than r in the 
following way. Assume that t is the first index for which X^i=i Ri — T - For an 
j < t, write S'j = Sj and A'j = Aj . Write R' t = r — Ylt=i Ri an d assign Z' t+l 
7: and thus the modified state S' t+1 = (0, R' t ). If t < T, then for all t+1 < j < 
T+1 set Sj = (0,0) and for all t + 1 < j < T draw A'j uniformly from A. The 
modified trajectory is given by the sequence {5^, A^, S^j • • • , A^, Srp_^_^ . Note 
that trajectories with fewer than IT + 1 entries and for which X^=i ^» — r 
are modified twice. 

The n modified trajectories are distributed according to the fixed dis- 
tribution P which can be obtained from Pq. This distribution is composed 
of the unknown distribution of each S^ conditional on (S' t _ 1 ,A' t _ 1 ), denoted 
by {/{, . . . , and exploration policy p'. The conditional distribution f[ 

equals fi, and for 2 < t < T + 1, 

/i( s tl s t-l> a I-l) 

f 

i=l 

t 

ft((z'ti r t)\st~i^t~i) dr t, z' t _ 1 ^0,^2r' i = r, 

i=l 

i 1 sJ = (0,O), z t-l = ' 

where G z ' t = {(z' t , rj) : Xw=i r i — r l an d 1a is 1 if A is true and is otherwise. 
The exploration policy p' agrees with p on every pair (S^, A^_i) for which 
Zt^0 and draws At uniformly from A whenever Zt = 0. The likelihood 
(under P) of the modified trajectory, {s[, a[, s' 2 , . . . , a' T , s' T+1 }, is 

T 

/i(si)m(ail4)n(^'( s *l s t-i' a ^ 

t=2 

Denote expectations with respect to the distribution P by E. 

Let 7r be a policy for the original problem. We define a version of the pol- 
icy it' for the auxiliary problem in the following way. For any state (s^a^) 
for which z' t 7^ 0, the same action is chosen. For any state (sj,a^_ 1 ) for 
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which z[ = 0, a fixed action at € A is chosen; w.o.l.g., let a Q be chosen. For 
the auxiliary problem, we say that two policies Tr' a and n' b are equivalent if 
ir f a (s' t ,a' t _ 1 ) = 7r^(sj, a^_ x ) for every (sj,a^_ 1 ) for which z[ / 0. We denote 
both the original policy and any modified version of it by tt whenever it 
is clear from the context which policy is considered. Similarly, we omit the 
prime from states and actions in the auxiliary problem whenever there is no 
reason for confusion. 

Let P n be the distribution in the auxiliary problem where actions are cho- 
sen according to tt. The likelihood under P w of the trajectory {s%, a±, S2, ■ ■ ■ , 
a T ,s T +i} is 

T 

/l(si) 1 7ri( Sl )=ai n^^ S *l S *- 1 ' a *- 1 ) 1 ^'( s ^ a t-i)= a *)^+l(' s T+l|sT,aT). 
t=2 

Denote expectations with respect to the distribution P v by E v . 

We now define the value functions and the Q-functions for policies in the 
auxiliary model. For any auxiliary policy tt define its corresponding value 
function V n . Given an initial state s\, V 7T (si) is the expected truncated-by-r 
survival time when the initial state is si and the actions are chosen according 
to the policy tt. Formally V JT (si) = -EVElli Rt\Sl = si] where the truncation 
takes place since the expectation is taken with respect to the distribution 
of the modified trajectories. The stage-i value function for the auxiliary 
policy 7r, V nt t( s tj a i-i)i is the expected (truncated) remaining survival time 
from the tth decision time-point, given the trajectory (s^,at_i), and when 
following the policy tt thereafter. Note that given s^, the survival time up 
to the beginning of stage t is known, and thus truncation ensures that the 
overall survival time is less than or equal to r. Formally V^^St, a^_i) = 

The stage-t Q-function for the auxiliary policy it is the expected remaining 
(truncated) survival time, given that the state is (st,aj_i), that at is chosen 
at stage t, and that ir is followed thereafter. Formally, 

Q-7r,t( s t) a t ) = E[R t + K,t+i( s t+i, A t ) | S t = s t , At = a t ]. 

The optimal value function V t *(st, at_i) and the optimal Q-function Q\ (st, at) 
are defined by (2) and (3), respectively. 

The following lemma relates the values of the value function in the 
auxiliary problem to the expected truncated-by-r survival time for a policy tt 
in the original problem. 

Lemma 4.1. Let II be the collection of all policies in the original prob- 
lem. Then for all ir S II, the following equalities hold true: 



(6) V n ( So )=E { 



0.7T 




At 
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(7) 



V*(s ) = maxEb.Tr 
vren 



' T \ 

I> A 



Si = S 



where and V* are value functions in the auxiliary problem. 

Proof. We start by decomposing the expectations depending on both the 
terminal stage and whether the sum of rewards is greater than or equal to r. 
Define 

Ft = j{s ,ai, . . . , s t+ i} :'Y]ri < r,z t+ i = j, 

G t = |{s ,ai, . . .,s k+ x}:t = minjj : E r * - r |' and k = T or Zk+1 = j' 
F[ = {(st+i , *r) : aj) € F t , . . . , s T +i} = {a Q , (0,0),..., (0, 0)}}, 

G' t = 1{s't+i,3-t) '■ ( s *> a t) i s a beginning of sequence in Gt, 

{s' t+1 , a' t+1 , s T+1 } = | ^0, t - r j^j , a Q , • • • , (0, 0) 1 1 . 



Denote 



fi,7r(Si,at_i) 



i-l 



= /l( s l)[ 1 7r(si)=ai] 11 (/j (»i I S J-1 ) a J-l) l-TTi ( Si )=oy ! 

x /t(s t |s t _i,a t _i) 

and similarly 



Note that 



-^0,71 



(8) 



and 



(9) 



T 



Ys Rt A 



Si = s c 



U=l 



Yl / ft+i,7r(st+i,a t )d(s t+ i,a t ) 

T 

+ r£l%, jr (G t ) 



K(s c ) = V / Vri f^ +1 (s T+ i,a T )d(s T+ i,a T ) 
t=1 JFl\ i=1 J 

+ rY J P,{G' t ). 



t=i 
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Note that 

(10) = f F ^13 r ^ f t+l,T( S *+ 1 ' a *) d ( S t+ 1 ' a *) 

= f pi \Y1 { t+iA s t+i . a r) d(s T+ i, a T ), 

where the first equality follows from (5) and the second follows since there 
is a one-to-one correspondence between trajectories in Ft and F(, and by 
construction, for each such trajectory in F[ we have Y2i=t+i r « = an< l 

T 

[ 1 7r t+ i(s t+ i,a t )=aJ (/j( S il S J-l' a J-l)l^(s J ,a J - 1 )=aJ/T+l(sT+l| s T, a T ) = 1. 

j=t+2 

Similarly, we show that Po >n (Gt) = P 7T (G' t ). Denote by Gt the set of all 
sequences (sj,at) which are the beginning part of some trajectory in Gt- 
Note that 

Po,w{G t )= f t (s t ,a t _i)[l 7rt(stjat _ l)=a J 
JG t 

x / / i+ i(s t+ i|s t ,a i )d(s i+ i)d(s t ,a i ) 

(11) = / f t(st,at-l)[ 1 7r t (s t ,a t _i)=aJ 

J Gt 

x / //+i(st+i|st,a t )d(s i+ i)d(s t ,a i ) 

= / fT+i( s T+i,a t )d(s T+ i,a t ) = P 7r (G' t ) 
Jg' 



>G't 

where the second equality follows from (5) and the third equality follows 
from the construction of G' t . 

The first assertion of the lemma, namely, (6), follows by substituting the 
right-hand side of the equalities (10) and (11) in (8) for each t and comparing 
to (9). 

The second assertion, (7), is proven by maximizing both sides of (6) over 
all policies. Note that the maximization is taken over two different sets since 
each policy in the original problem has an equivalent class of policies in the 
auxiliary problem. However, since V n is the same for all policies in the same 
equivalence class, the result follows. □ 
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5. The censored- Q-learning algorithm. We now present the proposed 
censored- Q-learning algorithm. As discussed before, we are looking for a pol- 
icy 7r that approximates the maximum over all deterministic policies of the 
following expectation: 



n,7T 



J> At 



,t=i 



We find this policy in three steps. First, we map our problem to the corre- 
sponding auxiliary problem. Then we approximate the functions {Ql, ■ ■ ■ , Q^p} 
using backward recursion based on (3) and obtain the functions {Qi, ■ ■ ■ , Qt}- 
Finally, we define n by maximizing Qt(st, (at_i,at)) over all possible ac- 
tions at- 

Let {Qi, . . . , Qt} be the approximation spaces for the Q-functions. We 
assume that Qt(st,&t) = whenever z% = 0. In other words, if a failure 
occurred before the ith time-point, Qt equals zero. 

Note that by (3), the optimal t-stage Q-function Q|(st,at) equals the 
conditional expectation of Rt + max at+1 Q%+\ (St+i , (A t ,a t+ i)) given (s t ,at). 
Thus 

Q t * = argmin£; ( R t + max (S t+1 , (A t , a t +i)) - Qt(S t , A t )) . 
Qt LV ai +! J J 

Ideally, we could compute the functions Qt using backward recursion in 
the following way: 

Q t = argminE n (R t + maxQ m (S m , (A f , a t+ i)) - Qt(S t , A t )) , 
Qt LV at + 1 J J 

where E n is the empirical expectation. The problem is that Rt may be 
censored and thus unknown. 

Note that E[5 t \Y* =l Ri] = P(C > E*=i Ri) = Sc(El=i Ri) an d thus 



E 



St 



St , A t , R t 



.Sc(Y,i=xRi) 

since St includes the information regarding R\ , . . . , 
dent of the covariates and actions. 
Thus, for every function Qt € Qt, 



1 



Rt-\ and C is indepen- 



E 



R t +max<3t +1 (Sf + i, (A t ,a t+ i)) -Q t (S t ,A t ) 



Ot+l 



E 



R t + maxQt +1 (S t+ i, (A t ,a t+ i)) - Q t (S t ,A t 

at+i 
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(12) 



E 



E 



[Rt + maxQ£ + i( s t+i: (A t ,a t+ i)) - Q t (S t ,A t ) 

St 



St, At, Rt 



= ^[(fit + maxQ,V(S m ,(At,at + i))-Qt(St,At)) 2 -§ 

Since Q| is the minimizer of the first expression in the above sequence 
of equalities, it also minimizes the last expression. Thus, we suggest to 
choose Qt recursively as follows: 

R t + max(5t+i(St+i, (A t ,a t+ i)) - Q t (S t ,A t ) 



(13) 



arg min E n 

Qt&Qt 



a.t+1 



where we define Qt+i = 0, and Sc is the Kaplan-Meier estimator of the 
survival function of the censoring variable Sc- Note that by Remark 3.1, 
the Kaplan-Meier estimator at x needs to estimate P{C > x) rather than 
P(C > x). This can be done by taking a right continuous version of the 
Kaplan-Meier estimator that interchanges the roles of failure and censoring 
events for estimation [see Satten and Datta (2001)]. 

We define the policies frt using the approximated Q-functions Qt as fol- 
lows: 

7Tf(st,a t _i) = argmaxQt(st, (at_i,ctt)). 

at 

6. Theoretical results. Let {Q\,...,Qt} be the approximation spaces 
for the minimization problems (13). Note that we do not assume that the 
problem is Markovian, but, instead, we assume that each Qt is a function 
of all the history up to and including stage t. Hence the spaces Qt can be 
different over t. 

We assume that the absolute values of the functions in the spaces {Qt}t 
are bounded by some constant M. Moreover, we need to bound the complex- 
ity of the spaces {Qt}t- We choose to use uniform entropy as the complexity 
measure [see van der Vaart and Wellner (1996)]. This enables us to obtain 
exponential bounds on the difference between the true and empirical expec- 
tation of the loss function that involves a random component, namely, the 
Kaplan-Meier estimator, as in (13) (see Lemma A. 6). This is different from 
Murphy (2005b) who uses the covering number as a measure of complexity 
[Anthony and Bartlett (1999), page 148] for the squared error loss function. 
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For every e > and measure P, we denote the covering number of Q 
by N(e,Q,L2(P)), where N(e, Q, L2(P)) is the minimal number of closed 
L2(-P)-balls of radius e required to cover Q. The uniform covering number 
of Q is defined as supp N(eM, Q,L2(P)) where the supremum is taken over 
all finitely discrete probability measures ? on Q. The log of the uniform 
covering number is called the uniform entropy [van der Vaart and Wellner 
(1996), page 84]. We assume the following uniform entropy bound for the 
spaces {Q t }: 



for all < e < 1 and some constants < W < 2 and D < oo, where the 
supremum is taken over all finitely discrete probability measures, and M is 
the uniform bound defined above. 

In the following, we prove a finite sample bound on the difference between 
the expected truncated survival times of an optimal policy and the policy tt 
obtained by the algorithm. As a corollary we obtain that the difference 
converges to zero under certain conditions. 

The proof of the theorem consists of the following steps. First we use 
Lemma 4.1 to map the original problem to the corresponding auxiliary one. 
Second, for the auxiliary problem, we adapt arguments given in Murphy 
(2005b) to bound the difference between the expected value of the learned 
policy and the expected value of the optimal policy using error terms that 
involve expectations of both the learned and optimal Q-functions. Third, we 
bound these error terms by decomposing them to terms that arise due to the 
difference between the empirical and true expectation, terms that arise due 
to the differences between the estimated and true censoring distribution, 
and terms related to the empirical difference between the estimated and 
optimal Q-function. Fourth, and finally, we obtain a finite sample bound 
which depends on the complexity of the spaces {Qt}, the deviation of the 
Kaplan-Meier estimator from the censoring distribution, and the size of the 
empirical errors in (13). 

Theorem 6.1. Let {Qi, . . . ,Qt} be the approximation spaces for the 
Q-functions. Assume that the uniform entropy bound (14) holds. Assume 
that n trajectories are sampled according to Pq. Let % be defined by (4)- 

Then for any < r] < 1, we have with probability at least 1 — tj, over the 
random sample of trajectories, 




) 



(14) 
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T T , 

(15) < Voe + L t/2 Y ( 2LH j ^K n 



t=i 



3=t 



x(F(Q t ,Q t+1 )-F(Ql,Q t+1 )) 

for all n that satisfies 
( 5T 

max< — exp{— nCie 4 + -v/nC^e 2 }, TC3 exp{— 2ne 4 + C^^fne 



1/2 



n 



where 

F(Qt,Qt+x) = ( Rt + max Q t+ i(S t+ i, A t , a t+ i) - Q t 1 

C X = 2(1 - G min ) 2 Afr 2 < in (4L)- 2 ( T+1 ), 
C 2 = C Q {\ - G min ) Mf 1 ( 4L ) ~ 1 \ 
C 3 = C a exp{(4L)-( T+1 )}, 

C, = C b {AL)^l\ 

and where M\ = (2M + r) 2 , C Q is the constant that appears in Bitouze, Lau- 
rent and Massart [(1999), equation (1)], C a , C b and U are the constants that 
appear in Lemma A. 6, and for some a small enough such that U + a Q < 2. 

Before we begin the proof of Theorem 6.1, we note that the bound (15) 
cannot be used in practice to perform structural risk minimization [see, e.g., 
Vapnik (1999)] for two reasons. First, the bound itself is too loose [see also 
Murphy (2005b), Theorem 1, Remark 4]. Second, the constants, such as C a 
and Cft, are not given, and are model-dependent. Interestingly, a bound on C 
was established recently by Wellner (2007). However, this bound is large and 
simulations suggest that it is not tight. The bound (15) can, however, be used 
to derive asymptotic rates [Steinwart and Christmann (2008), Chapter 6]. 
Moreover, when the functions Qf are in Qt, we obtain universal consistency, 
as stated in the following corollary: 

Corollary 6.2. Assume that the conditions of Theorem 6.1 hold. As- 
sume also that for every t, Qt £ Qt. Then 



SUP £ ,7T 

Tren 




At 




At 



0. 



Proof. Note that for every t, Qt is the minimizer of 

St 



IE, 



-F(Q t ,Qt+i) 
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Hence, the second expression in the right-hand side of (15) equals zero, and 
the result follows. □ 



Proof of Theorem 6.1. By Lemma 4.1, 



SUP £t),7T 

Tren 



T 



,t=i 



E, 



0,7T 



T 



,i=l 



£[K*csi)-y#(Si)], 



where the expectation on the right-hand side of the equality is with respect 
to the modified distribution P. 

By Lemma 2 of Murphy (2005b) and Remark 2 that follows, for every 
state s £ Si , 



^*(s ) - F # ( So ) < Y,2L t/2 \jE[(Q t (S t , A t ) - Q*(S t , Ai)) 2 |,Si = s \. 



t=i 

Applying Jensen's inequality, we obtain 

T 



(16) E[V*{S{) - V^Si)] <^2L*/yS[(Q t (S t , A t ) - Q t *(S t , A t )) 2 ]. 



i=l 

We wish to obtain a bound on the expression E[(Qt{St, A t ) — Q*(St, A t )) 2 ] 
using the expressions Err^ (Qt) — Err^ {Qt)t where 

Err Qt+i(Qt) = E (Rt + maxQ t +i(St+i, A t ,a t+ i) - Q t (S t ,A t )) 

for any pair of function Qt and Qt+i- To obtain this bound we follow the 
line of arguments that leads to the bound in equation (13) in the proof of 
Theorem 1 of Murphy (2005b). The bound (19) obtained here is tighter 
since only the special case of Q* in the second Err function is considered. 
To simplify the following expressions, we write Qt instead of Qt(St,A t ) 
whenever no confusion could occur. 
For each t, 



= E[Ql)-E[(Q* t f] 
+ 2E 



R t + max Qt+i (S t+ i , A t , a t+ i) (Q£ - Qt j 

at+i 



E[Q 2 t ]-E[(Q* t 
+ 2E 



(Q* t - Q t )E R t - maxQ* +1 (S t+ i, A u a t+1 ) |S*, A t 

Ot+l 
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+ 2E 



( max Q* t+l (S t+ i , A t , a t+ i ] 



V a.t+1 



(17) 



- maxQi + i(S t+ i, A t ,a t+ i) )(Qt - Qt) 

at+i / 

: E[Q 2 } - E[(Q* t ) 2 ] + 2E[(Q* t ) 2 ] - 2E[Q t Q* t ] 
2E (maxQt +1 (St+i,A t ,a t+ i) 

\a t +i 

- maxQ i+1 (S t+ i, A t ,a t+ i) - Qt) 

a.t+1 / 



■E[{Q t -Qlf] 
+ 2E 



(maxQj+^Si+i, A t ,o t+ i) 
V a t +i 

- m&xQt + i(St + i,A t ,a t+ i))(Qt - Qt] 

a.t+1 / 



where the second to the last equality follows since 

Qt(st,a. t ) = E ( R t - max Qt +1 (S t+ i, A t , a t+1 ))\S t = s t , A t = a t 



o.t+1 



Using the Cauchy-Schwarz inequality for the second expression of (17), 
we obtain 

Err Qt+i(Qt)- Err Qt+i(Qt) 



>E[(Q t -Q*t) 2 } 



2E 



(m&xQ* t+1 (S t+ i,At,a t+1 ) - maxQ m (S m , A t ,a t+ i) 



V a t +i 



at+i 



1/2 



xE[(Q* t -Q t ) 2 ] 1/2 . 



Note that 
E 



(max Qt +1 (St+i ,A t ,a t+ i)- max Q t +i (S t +i , A t , a t+ i) 

\at+i at+i 

< E max(Q* +1 (S m , A t ,a t+ i) - Q t +i(S t +i,A t ,a t+1 )) 



(18) 



l a.t+1 



<E 



L ^2(Qt+i( s t+i,A t ,a) - Q t+ i(S t+1 , A t ,a)) 2 p t {a\S t+ i, A t ) 
aeA 



= LE[(Q* t+1 (St+i, A t+1 ) - 4+1 (St+i , A m )) 2 ] , 

where the first inequality follows since (max a h(a) — max a h'{a)) 2 < 
max a (/i(a) — h'(a)) 2 and where L is the constant that appears in the defini- 
tion of the exploration policy p (see Section 3). 
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Using inequality (18) and the fact that xy < \(x 2 +y 2 ), we obtain 
Err Qt+i (Q t ) - Err$ t+i (Q* t ) 

> E[(Q t - Q* t f] - E[4L(Q* t+1 - Q t+l f] 1/2 E[{Q* t - Qt) 2 ] 1 ' 2 

> \E[(Q t - Ql) 2 } - 2LE[(Q* t+1 - Q t+1 ) 2 ]. 



Hence 



E[(Q t - Qt) 2 ] < 2(ErTQ t+i (Q t ) - Err^JQD) + ALE[(Q* t+1 - Q m ) 2 ]. 
Using the fact that Qt+i = Qt+i = Oj we obtain 



T 



(19) E[(Q t - Q* t ) 2 \ <2Y,(4Ly- t (Err . +i (Q j ) - Err^JQ*)). 

3=t 

We are now ready to bound the expressions Errq (Qj) — EtTq. x (Qp- 
For any Q t G Qt U Qj, Qt+l £ St+lj and censoring survival function if: 
[0,r] H» [ifmin, 1], where -K" m i n > 0, define 



£(Q t ,Q t+1 ,K) 



E 



(20) 



£n{Qt,Qt+l,K) 



a'(e:=i^) v 



Rt + max Qt+i (S m , A t ,a t+ i) - Qt 

Ot+l 



i?t + max Qt+i (St+i, At, at+i) - Q< 



Note that similarly to (12) we have Err a. (Q t ) = £(Qt,Qt+i,Sc), where Sc 
is the censoring survival function. 
Using this notation, we have 

= £(Qt,Qt+i,Sc) -£{Ql ,Qt+i,Sc) 

< \£(Q t , Qt+i,S c ) - £{Q t ,Qt+uSc)\ 
+ \£(Qt, Qt+i,Sc) — £ n (Qt,Qt+i,Sc)\ 
+ (£ n (Qt,Qt+i,Sc) — £ n (Q$,Qt+i,Sc)) + 
+ \£n(Qt,Qt+uSc) —£(QhQt+i,Sc)\ 
+ \£(Q* t ,Qt+i,Sc)-£(Q*t,Qt+uSc)\, 
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where Sc is the Kaplan-Meier estimator of Sc, and (a)+ = max{a, 0}. Hence 

<2 sup \£(Q t ,Qt+i,S c ) - £(Qt,Qt+i,S c )\ 

, x {QuQt+i} 
(21) 

+ 2 sup \£(Q t ,Q t+1 ,K)-£ n (Q t ,Q t+1 ,K)\ 
{Qt,Qt+i,K} 

+ (£ n (Qt,Qt+i,Sc) - £n{Q*t ,Qt+l,Sc)) + - 
Combining (19) and (21), and substituting in (16), we have 
£?[7*(5i)-^(5i)] 

T T 



< 2^L'/ 2 ^ yj2(4Ly-t(Err Qt+1 (Q t ) - Err Qt+1 {Q*)) 
t=i j=* 



(22) <8(4L)( T+1 )/ 2 < /max sup |£(Q t , Q m , S c ) - £ (Qt, Qt+i, S c )\ 

1 {Qt,Qt+i} 



(23) +8(4L)( T+1 )/ 2 /max sup \£(Q t ,Q t+1 , K) - £ n (Q t ,Q t+1 , K) 

1 {Qt,Qt+i,K} 



+ 2Y,L t/2 2^*^/2^2(^(4, 4+1, Sc) " £n(Q t *» 4+1, &?))+, 
t=l j=t 

where we used the fact that ELi L * /2 Ej=t( 4L ) (j ~ <)/2 < 2(4L)( T+1 )/ 2 for 
L > 2 and the fact that sfx + y < \fx + yjy. 

In the following, we replace the bounds in (22) and (23) with exponential 
bounds. We start with (22). Note that (R± + max at+1 Qt+±(St+i, At, at+i) — 
Qt) 2 < Mi = (2M + r) 2 for all Q t , Q t+1 . Hence, 

sup |5(Q t ,Q T+1 ,5 c ) -5(Qi,Qr+i,5 c )| <M X K^E[\S C - Sc\\ 
{Qt,Qt+i} 

and thus 



P (4L) T / 2+2 /max sup |£(Q t , Q T +i, S c ) - £(Qt, Qt+i, S c )\ > e 
v V {Qt,Qt+i} 

T 

(24) <Vp((4L) t / 2+2 / sup |5(Q t , Qt+i, Sc)-f(Qt, Qt+i, 5 c )|>e 
t=i v V {Qt,Ot+i} 



< TP((4L) T / 2+2 JM.K^JS - Sc\\oo > e), 
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where the first equality follows from the fact that 

T 

(25) P( max X t > c) < Vp(JT ( > c). 

Using a Dvoretzky-Kiefer-Wolfowitz-type inequality for the Kaplan-Meier 
estimator [Bitouze, Laurent and Massart (1999), Theorem 2], we have 

P(\\Sc-So\\oo>e') 

(26) 

< f exp{-2n(l - G min ) 2 (e') 2 + Cov^U - G min )e'}, 

where C Q is some universal constant and G m j n is a lower bound on the 
survival function at r (see Section 3). 



Write e = (4L)( T+1 V 2 J MiK~? n s\ and thus e' = Mf 1 ^ in e 2 (4L)-( T+1 ) . 



Note that 8(4L) T / 2 + 2 ^M 1 ^- i 2 n ||5 c; - S c ||oo > 8e iff ||S C - S c \\oo > e' . Ap- 
plying the inequality (26) to the right-hand side of (24) and substituting 
for e, we obtain 

P(8(4L)( T+1 )/ 2 /sup sup \£(Q t ,Q T+u S c )-£(Qt,QT+i,Sc)\>8e) 
v V * {Qt,Qt+i} J 

< ^exp{-2n(l -G min ) 2 M 1 - 2 < in e 4 (4L)- 2 ( T + 1 ) 

(27) 

+ C oV ^(l - G min )M 1 - 1 K min e 2 (4L)-( T + 1 )} 

= ^ exp{-nCie 4 + V™C2£ 2 }, 

where d = 2(1 -G min ) 2 M^Kf niQ (4L)~ 2 ^ and C 2 = C (l -G min )Mf 1 x 
^ min (4i)- (T+1) . 

We now find an exponential bound for (23). We follow the same line of ar- 
guments, replacing the Dvoretzky-Kiefer-Wolfowitz-type inequality used in 
the previous proof with the uniform entropy bound. Recall that by assump- 
tion, the uniform entropy bound (14) holds for the spaces Qt and thus also 
for the spaces QtUQ*- Hence, by Lemma A. 6, and (25), for W' = max{W, 1} 
and for all a > 0, we have 

pf8(4L)( T+1 )/ 2 /max sup \£(Q t ,Q t+1 , K) - £ n (Q u Q t+1 , K)\ > 8e 



{Qt,Qt+i,K} 

(28) < TC a exp{C7 fe ^(4L)- {T+1)/2 e 2(C/+a) - 2n(4L)"( T+ V} 
= TC 3 exp{C4^£ 2(C/+Q) - 2ne 4 }, 

where C 3 = C a exp{(4L)-( T+1 ) }, C 4 = C b (4L)( T+1 )/ 2 and U = W'(6 - W')/ 
(2 + W). 
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Take n large enough such that the right-hand sides of (27) and (28) are 
less than 77/2 and substitute in (22) and (23), respectively, and the result of 
the theorem follows. □ 

7. Simulation study. We simulate a randomized clinical trial with flexi- 
ble number of stages to examine the performance of the proposed censored- 
Q-learning algorithm. We compare the estimated individualized treatment 
policy to various possible fixed treatments. We also compare the given ex- 
pected survival times of different censoring levels. Finally, we test the effect 
of ignoring the censoring. 

This section is organized as follows. We first describe the setting of the 
simulated clinical trial (Section 7.1). We then describe the implementation 
of the simulation (Section 7.2). The simulation results appear in Section 7.3. 

7.1. Simulated clinical trial. We consider the following hypothetical can- 
cer trial. The duration of the trial is 3 years. The state of each patient at each 
time-point u £ [0, 3] includes the tumor size [0 < T(u) < 1], and the wellness 
[0.25 < W(u) < 1]. The time-point u Q such that W(u ) < 0.25 is considered 
the failure time. We define the critical tumor size to be 1. At time U{ such 
that T{ui) = 1, we begin a treatment. We call the duration [iij,Uj+i] the ith 
stage. Note that different patients may have different numbers of stages. 

At each time-point Uj, we consider two optional treatments: a more ag- 
gressive treatment (A), and a less aggressive treatment (B). The immediate 
effects of treatment A are 



that is, the wellness at time Ui after treatment A [denoted by W(ti?~|.A)] 
decreases by 0.5 wellness units. The tumor size at time Ui after treatment A 
[denoted by T(uf\A)] decreases by a factor of l/(10W(ui)) which reflects 
a greater decrease of tumor size for a larger wellness value. Similarly, the 
immediate effects of the less aggressive treatment B are 



which, in comparison to the treatment A, has lower effect on the tumor size 
but also lower decrease of wellness. The wellness and tumor size at time 
U{ <u< Ui + \ follow the dynamics 



(29) 



W(uJ\A) = W{m) -0.5, 
T(ut\A)=T( Ui )/(WW(u t )) 



(30) 



W(uf\B) = W(ui) -0.25, 
T(ut\B) = T(u i )/(W(u i )) 



(31) 



W{u) = W{uf) + (1 - W{uf)){\ - 2^ u ~ u ^/ 2 ) 
T(u)=T(u+)+AT( U +)( u - Ui )/3. 
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The stage that begins at time-point 114 ends when either T(iii+i) = 1 for 
some Ui < Ui+i < 3 or when a failure event occurs or at the end of the trial 
when u = 3. During this stage, we model the survival function of the patient 
as an exponential distribution with mean 3(W(uf) + 2)/20M(m^~). 

The trajectories are constructed as follows. We assume that patients are 
recruited to the trial when their tumor size reaches the critical size, that 
is, for all patients T(0) = 1, and hence u\ = is the beginning of the first 
stage. The wellness at the beginning of the first stage, W(0), is uniformly 
distributed on the segment [0.5, 1]. With equal probability, a treatment a\ € 
{A, B} is chosen. If no failure event occurs during the first stage, the first 
stage ends when either T(u2) = 1 for some = u\ < 112 < 3 or at the end of 
the trial. If the first stage ends before the end of the trial, then with equal 
probability another treatment 02 £ {A B} is chosen. The trial continues in 
the same way until either a failure time occurs or the trial ends. We note 
that the actual number of stages for each patient is a random function of the 
initial state and the treatments chosen during the trial. Due to the choices 
of model parameters, the number of stages in the above dynamics is at least 
one and not more than three. 

For each trajectory, a censoring variable C is uniformly drawn from the 
segment [0, c] for some constant c > 3, where the choice of the constant c 
determines the expected percentage of censoring. When an event is censored, 
the trajectory (i.e., the states and treatments) up to the point of censoring 
and the censoring time are given. 

7.2. Simulation implementation. The Q-learning algorithm presented in 
Section 5 was implemented in the Matlab environment. For the implemen- 
tation we used the Spider library for Matlab. 2 The Matlab code, as well as 
the data sets, are available online [see Goldberg and Kosorok (2012)]. 

The algorithm is implemented as follows. The input for the algorithm 
is a set of trajectories obtained according to the dynamics described in 
Section 7.1. First, the Kaplan-Meier estimator for the survival function of 
the censoring variable is computed from the given trajectories. Then, we set 
Q4 = and compute Qi, i = 3, 2, 1 backwardly, as the minimizer of (13) over 
all the functions Qi(si,ai) which are linear in the first variable. The policy tt 
is computed from the functions {Qi, Q2, Q3} using (4). 

We tested the policy fr = (-fri, %2, ^3) by constructing 1000 new trajec- 
tories, in which the choice of treatment at each stage is according to rr. 
One thousand initial wellness values were drawn uniformly from the seg- 
ment [0.5,1]. For each wellness value, a treatment was chosen from the set 
{A, B}, according to the policy it\. The immediate effect of the treatment 



2 The Spider library for Matlab can be downloaded from http : //www . kyb . tuebingen . 
mpg . de/bs/people/ spider/ . 
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was computed according to (29)-(30). A failure time was drawn from the 
exponential distribution with mean as described in the previous section; de- 
note this time by f\. The time that the tumor reached the critical size was 
computed according to the dynamics (31), and we denote this time by u%. If 
both f\ and U2 are greater than 3 (the end of the trial), then the trajectory 
was ended after the first stage and the survival time for this patient was 
given as 3. Otherwise, if /i < «2, the trajectory was ended after the first 
stage and the survival time for this patient was given as /i . If u-i < f\ , then 
at time U2, a second treatment is chosen according to the policy tt2- The 
computation of the remainder of the trajectory is done similarly. The ex- 
pected value of the policy n is estimated by the mean of the survival times 
of all 1000 patients. 

We compared the results of the algorithm to all fixed treatment sequences 
A1A2A3, where £ {A, B}. The expected values of the fixed treatment 
sequences were computed explicitly. We also compared the results to that of 
the optimal policy, which was also computed explicitly. 

7.3. Simulation and results. First, we would like to examine the influence 
of the sample size and censoring percentage on the algorithm's performance. 
We simulated data sets of trajectories of sizes 40, 80, 120, . . . , 400. For each 
set of trajectories we considered four levels of censoring: no censoring, 10% 
censoring, 20% censoring, and 30% censoring. Higher levels of (uniform) cen- 
soring were not considered since this requires drawing the censoring variable 
from a segment [0, c] for c < 3, which is in contrast to the assumption on the 
censoring variable (see the beginning of Section 3). A policy tt was computed 
for each combination of data set size and censoring percentage. The policy tt 
was evaluated on a data set of size 1000, as described in Section 7.2. We 
repeated the simulation 400 times for each combination of data set size and 
censoring percentage. The mean values of the estimated mean survival time 
are presented in Figure 1. A comparison between the different fixed policies, 
policies obtained by the algorithm for different censoring levels, and the 
optimal policy appears in Figure 2. As can be seen from both figures, the 
individualized treatment policies obtained by the algorithm are better than 
any fixed policy. Moreover, as the number of observed trajectories increases, 
the expected survival time increases, for all censoring percentages. 

We also examined the influence of the sample size and censoring percent- 
age on the distribution of estimated expected survival time. We simulated 
data sets of sizes 50, 100, 200, . . . , 3200 and we considered the four levels of 
censoring as before. As can be seen from Figure 3, the variance decreases 
when the sample size becomes larger. Also, the variance is smaller for smaller 
percentage of censoring, although the difference is modest. 

Note that the maximum expected survival times obtained by the algo- 
rithm are a little bit above 17 months (see both Figures 1 and 2), while the 
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Fig. 1. The solid black curve, dashed blue curve, dot-dashed red curve, and dotted green 
curve correspond to the expected survival time (in months) for different data set sizes with 
no censoring, 10% censoring, 20% censoring and 30% censoring, respectively. The expected 
survival time was computed as the mean of 400 repetitions of the simulation. The black 
straight line, blue dashed straight line, and the dot-dashed red straight line correspond to 
the expected survival times of the optimal policy, the best fixed treatment policy, and the 
average of the fixed treatment policies, respectively. 



value of the optimal policy is 17.85. The difference follows from the fact that 
the Q-functions estimated by the algorithm are linear while the optimal Q- 
function is not (see Figure 4). It is worth mentioning that even in the class 
of linear functions on which the optimization is done there are Q-functions 
that yield higher values. This fact is often referred to as the "mismatch" 



20 1 1 1 1 1 1 1 1 1 1 1 r 




AAA AAB ABA ABB BAA BAB BBA BBB 30% 20% 10% None Opt 



Fig. 2. The eight light gray bars represent the expected survival times for different fixed 
treatments where A1A2A3 indicates the policy that chooses Ai at the ith stage. The four 
dark gray bars represent the expected survival times for policy n obtained by the algorithm 
with no censoring, 10% censoring, 20% censoring and 30% censoring. The white bar is 
the expected value of the optimal policy. The values of the fixed treatments and the optimal 
policy were computed analytically while the values of it are the means of 400 repetitions of 
the simulation on 200 trajectories. 
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Fig. 3. Distribution of expected survival time (in months) for different data set sizes, with 
no censoring, 10% censoring, 20% censoring and 30% censoring. Each boxplot is based on 
400 repetitions of the simulation for each given data set size and censoring percentage. 



that follows from the fact that optimization of the value function is not per- 
formed explicitly, but rather through optimization of the Q-functions [see 
Tsitsiklis and van Roy (1996), Murphy (2005b), for more details]. 

Figure 5 shows the number of treatments that were needed for patients 
that followed the policy ft and did not have a failure event during the trial. 



Q f (W,A3 Q,(W,B) V(W) 




I | , | | , 

0.5 0.75 1 0.5 0.75 1 0.5 0.75 1 
Wellness Wellness Wellness 



Fig. 4. The Q-functions computed by the proposed algorithm for a size-200 trajectory set. 
The left panel presents both the optimal Q-function (solid red curve) and the estimated ex- 
junction (dashed blue curve) Jor different wellness levels and when treatment A is chosen. 
Similarly, the middle panel shows both Q-functions when treatment B is chosen. The right 
panel shows the optimal value function (solid red curve) and the estimated value function 
(dashed blue curve). 
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Fig. 5. 27ie number of required treatments for patients that follow the policy tt, when 
no failure event occurs during the trial. The policy tt was estimated from 100 trajectories. 
The results were computed using a size 100,000 testing set. 

As can be seen from this figure, patients with high initial wellness need only 
one treatment. On the other hand, patients with very low initial wellness 
value need three treatments. 

Finally, we checked the effect of ignoring the censoring on the expected 
survival time. We considered two ways of ignoring the censoring. First, we 
consider an algorithm that ignores the weights in the minimization prob- 
lem (13). This is equivalent to deleting the last stage from each trajectory 
that was censored. We also consider an algorithm that deletes all censored 
trajectories. In the example presented in Figures 1-5, where uniform cen- 
soring takes place, there is a relatively moderate difference between the 
expected survival time for the proposed algorithm and the other two algo- 
rithms that ignore censoring. However, when the censoring variable follows 
the exponential distribution (leaving fewer observations with longer survival 
times), the bias from ignoring the censored trajectories is substantial, as can 
be seen in Figure 6. 

8. Summary. We studied a framework for multistage decision problems 
with flexible number of stages in which the rewards are survival times and 
are subject to censoring. We proposed a novel Q-learning algorithm adjusted 
for censoring. We derived the generalization error properties of the algorithm 
and demonstrated the algorithm performance using simulations. 

The work as presented is applicable to real- world multistage decision prob- 
lems with censoring. However, two main issues should be noted. First, we 
assumed that censoring is independent of observed trajectories. It would 
be useful to relax this assumption and allow censoring to depend on the 
covariates. Developing an algorithm that works under this relaxed assump- 
tion is a challenge. Second, we have used the inverse-probability-of-censoring 
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Fig. 6. The solid blue curve, dashed black curve, and dot-dashed red curve correspond to 
the expected survival times (in months) for different data set sizes, for the proposed algo- 
rithm, the algorithm that ignores the weights, and the algorithm that deletes all censored 
trajectories, respectively. The censoring variable follows the exponential distribution with 
50% censoring on average. The expected survival time was computed as the mean of 400 
repetitions of the simulation. 



weighting to correct the bias induced by censoring. When the percentage 
of censored trajectories is large, the algorithm may be inefficient. Finding 
a more efficient algorithm is also an open question. 

APPENDIX: SUPPLEMENTARY PROOFS 

The main goal of this section is to provide an exponential bound on the 
difference between the empirical expectation £ n (Q t ,Qt+i,K) and the true 
expectation £(Qt, Qt+i, K) as a function of the uniform entropy of the class 
of functions [see (20)]. This result appears in Lemma A. 6. Similar results 
for Glivenko-Cantelli classes, Donsker classes and bounded uniform entropy 
integral (BUEI) classes can be found in van der Vaart and Wellner (1996) 
and Kosorok (2008). 

Lemma A.l. LetFi, . . ., J- k bek sets of functions. Assume that for every 
jE{l,...,k} ; suPfe^WfWoo^Mj. Let(t>:R k ^R satisfy 

k 

(32) \c/> o f(x) - d> o g(x)\ 2 < c 2 ^2(fj(x) - g,(x)) 2 

3=1 

for every f = (/i, . . . , f k ),g = (g u . . . ,g k ) G T\ X • ■ • X JF k , where < c < oo. 
Let P be a finitely discrete probability measure. Define (f>o . . . , F k ) = 
Wi, ■••»/*): (/i,-- -Jk) € Ji x •■• x F k }. Then 

/ k \ k 

(33) N ( ec £M i ,0o(Ji,...,^,L 2 (P) <nJV(eM i5 ^-,L 2 (P)). 

V 3=1 ' 3=1 
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Proof. The proof is similar to the proof of Kosorok (2008), Lemma 9.13. 
Let /, g G F\ x • • • x Fk satisfy \\fj- gj \\p,2< zMj for 1 < j < k. Note that 



which implies (33). □ 

The following two corollaries are a direct result of Lemma A.l: 

Corollary A. 2. Let K. = {K:K is monotone decreasing K:[0,t] h-> 
[i^ m i n ,l]}. Define IC~ 1 = {1/K:K G JC}. Let P be a finitely discrete proba- 
bility measure. Then 



Proof. Note that inequality (32) holds for k = 1 and c = K m l n , and the 
results follow from Lemma A.l. □ 

Corollary A. 3. Let Q C {Q(x, a) : x G W, a G {1, . . . , k}, \\Q\\oo <M}. 
Define Q max = {max a Q(x, a):Q G Q}. Let P 6e a finitely discrete probability 
measure. Then 



Proof. Since (max a /i(a) — max a h'(a)) 2 < max a (/i(a) — h'(a)) 2 , in- 
equality (32) holds for c = 1. The results now follow from Lemma A.l. □ 

We also need the following lemma and its corollary: 

Lemma A. 4. Let F\ and F2 be two function classes uniformly bounded 
in absolute value by M\ and M2, respectively. Define F\ ■ F2 = {/1 • fi : 



fi G Fi}. Then 

N(2eM 1 M 2 ,F 1 -F 2 ,L 2 (P) ) < N{eM 1 ,Ft,L 2 (P)) • N(eM 2 ,F 2 ,L 2 (P) ) . 
PROOF. Let \\fj — gj\\p,2 < eMj where fj,gj eFj, j = {1,2}. Note that 

\\h ■ h -gi -g2\\p,2 < ||/i(/2-52)||p,2 + ||52(/i -gi)\\p,2 



The result follows. □ 

Corollary A. 5. Let Q be a function class uniformly bounded in abso- 
lute value by M . Define Q 2 = {g 2 :g G Q}. Then 

N(2eM 2 ,g 2 ,L2(P))<N(eM,g,L 2 (P)) 2 . 




k 

E 



N(eK^ n ,JC- 1 ,L2(P))<N(e,IC,L2(P)). 



N(ekM, Q max , L 2 (P)) < N(eM, Q, L 2 (P)) k . 



< Mi 1 1 f 2 - g 2 1 1 p,2 + M 2 1 1 /1 - gi \ | P , 2 < 2M X M 2 e . 
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Proof. Apply Lemma A. 4 with T\ = T 2 = Q- □ 

We use the previous results to prove the following lemma: 

Lemma A. 6. Let 

Qt C {Q t {x,a):x G M. Pt ,a G {1, . . -,k}, ||Qt||oo < M}, 
fC = {K : K is monotone decreasingK : [0, r] 1— > \K m fa, 

Tt= |^^y( r + m&xQ t+ i(x,a) -Qt(x,a)^j : r G [0,r], 

Qt£Qt,Qt+ieQt+uKe}c\, 

where t G 1, . .. ,T and Qt+i = {0}. Assume that the uniform entropy bound 
for each of the spaces Qt (14) holds. Then: 

(1) There are constants D' andW' such that log N(e,K,L 2 (P)) < D'{\) w 
where W' = max{W, 1}. 

(2) For every a > and t>0, 

P*(sup\\Ef-E n f\\>t)<C a exp{C b Vn~t u+a -2nt 2 }, 

where U = W'(6 — W')/{2 + W'), the constants C a and C\, depend only on D' , 
W and a, and where P* is outer probability. 

Proof. Let W' = m&x{W, 1}. Note that uniform entropy bound (14) for 
the spaces Q t holds also for W' . Note that by Corollary A.3, log N(eM, Q™ ax , 
L 2 {P)) < Dk w ' +1 (\) w '. Since (x + y + z) 2 < 3(x 2 + y 2 + z 2 ), we can apply 
Lemma A.l to the class 

Q = j r + maxQ t+1 (x,a) - Q t (x,a):r G [0,T],Q t G Qt,Qt+i G Qt+i} 

with c = y/3 and (j)(x,y,z) = x + y + z to obtain log N(\^3e(2M + t),Q, 
L 2 (P)) < (r + Dk w +1 + Z))e _H/ , where we used the fact that the segment 
[0, r] can be covered by no more than t/e+1 balls of radius e and that log(l + 
r/e) < r/e. By Corollary A.5, we have logiV(2 • 3e(2M + r) 2 ,g 2 , L 2 (P)) < 
2(r + ,Dfc w/,+1 + £>)(|) W ' or, equivalently, 

n\ w ' 

logN(eM 1 ,g 2 ,L 2 (P))<D 1 i-\ , 

where M\ = (2M + r) 2 is a uniform bound for Q 2 , and D\ = 2(r + DA; 117 ' 4 " 1 + 
D)6- M/ '. 
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By Kosorok [(2008), Lemma 9.11], log N (e, K,, L 2 (P)) < D 2 e^ for some 
universal constant D 2 which is independent of the choice of probability mea- 
sure P. By Corollary A. 2, 



Since this inequality holds for every finitely discrete probability measure P, 
assertion (1) is proved. The second assertion follows from van der Vaart and 
Wellner (1996), Theorem 2.14.10. □ 
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SUPPLEMENTARY MATERIAL 

Code and data sets (DOI: 10. 1214/12- AOS968SUPP; .zip). Please read 
the file README.pdf for details on the files in this folder. 
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